This markdown document is a gentle introduction to the {sf} package. The intent is not to be exhaustive, but to demonstrate:
{sf} allows us to do typical GIS operations within R{sf} that are used often in the MUSA 508 assignmentsOther fantastic resources include:
{sf} that are out there.This exercise will pick up where the Week 1 {tidycensus} left off; loading census data using geometry = TRUE to result in an {sf} object.
library(tidyverse)
library(tidycensus)
library(sf)
library(tmap) # mapping, install if you don't have it
set.seed(717)
Your Census API key should be loaded from the week 1 exercise. If you get an API error about the key, go back and look at the Week 1 markdown.
{tidycensus}We will use the same ACS variables as in the last exercise. The object called acs_vars is used to keep the variables we are interested in. Also, we will focus on specific tracts in the Mt. Airy neighborhood. Finally, we fetch the data using the get_acs() function. The code below is the same code used at the end of Week #1 with the exception of removing the last line calling st_as_af() . We’ll discuss why later.
acs_vars <- c("B01001_001E", # ACS total Pop estimate
"B25002_001E", # Estimate of total housing units
"B25002_003E", # Number of vacant housing units
"B19013_001E", # Median HH Income ($)
"B02001_002E", # People describing themselves as "white alone"
"B06009_006E") # Total graduate or professional degree
myTracts <- c("42101023500",
"42101023600",
"42101023700",
"42101025300",
"42101025400",
"42101025500",
"42101025600",
"42101038800")
acsTractsPHL.2016.sf <- get_acs(geography = "tract",
year = 2016,
variables = acs_vars,
geometry = TRUE,
state = "PA",
county = "Philadelphia",
output = "wide") %>%
dplyr::select (GEOID, NAME, all_of(acs_vars)) %>%
rename (total_pop.2016 = B01001_001E,
total_HU.2016 = B25002_001E,
total_vacant.2016 = B25002_003E,
med_HH_Income.2016 = B19013_001E,
total_White.2016 = B02001_002E,
total_GradDeg.2016 = B06009_006E) %>%
mutate(vacancyPct.2016 = total_vacant.2016/total_HU.2016,
pctWhite.2016 = total_White.2016/total_pop.2016) %>%
mutate(mtAiry = ifelse(GEOID %in% myTracts, "MT AIRY", "REST OF PHILADELPHIA"))
{sf} and why spatial is specialThere are two important additions in the above code that allow us to go from tabular data to spatial data. This is incredibly powerful because once data has a spatial component, it opens up a whole new world (pun intended!) of analytical possibilities. Instead of only analyzing how the variables are correlated, we can now measure how they are correlated over space, how near, how far, how clustered or dispersed. We have access to analyzing the entire spatial system that led to these data being what they are. To quote the Tobler’s First Law of Geography, “everything is related to everything else, but near things are more related than distant things.” wikipedia. The rest of this semester will be using spatial relations to better understand data for more effective public policy.
To understand how the {sf} helps us do that, we will break it down using the acsTractsPHL.2016.sf object we just created from {tidycensus}. Compared to fetching tabular data, the one new addition to that code is:
geometry = TRUE argument in the call to get_acs()Using the geometry = TRUE argument in the get_acs() function does a lot of work for us. That argument tells the {sf} package to call on the {tigris} package to download a spatial data (think about a shapefile) from the U.S. Tiger/Line for the level of detail requested (e.g. state, county, tract, etc…), join the tabular ACS data to the geometry using the GEOID field, and then turn it into an {sf} object that R can use to map and analyze. Looking into the acsTractsPHL.2016.sf object, we can understand this better.
class(acsTractsPHL.2016.sf)
## [1] "sf" "data.frame"
First, we can see what class the acsTractsPHL.2016.sf object is. It is class sf first and then data.frame second. This means that you can write R code to treat this as a spatial class, but that you can also do a lot of the things you would normally do with a data.frame to it as well. That is very handy!
If we print the acsTractsPHL.2016.sf object, it looks a lot like a table, but has some extra information at the top that it is an {sf} object and has attributes aside from the tabular data.
head(acsTractsPHL.2016.sf,2)
## Simple feature collection with 2 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -75.23584 ymin: 39.96507 xmax: -75.22407 ymax: 39.99188
## Geodetic CRS: NAD83
## GEOID NAME
## 1 42101010200 Census Tract 102, Philadelphia County, Pennsylvania
## 2 42101011900 Census Tract 119, Philadelphia County, Pennsylvania
## total_pop.2016 total_HU.2016 total_vacant.2016 med_HH_Income.2016
## 1 3008 1483 225 16071
## 2 4738 2509 407 30854
## total_White.2016 total_GradDeg.2016 geometry
## 1 0 145 MULTIPOLYGON (((-75.23536 3...
## 2 47 175 MULTIPOLYGON (((-75.23367 3...
## vacancyPct.2016 pctWhite.2016 mtAiry
## 1 0.1517195 0.000000000 REST OF PHILADELPHIA
## 2 0.1622160 0.009919797 REST OF PHILADELPHIA
You will notice new information at the top of the print out from the acsTractsPHL.2016.sf$geometry object. This includes:
The first line tells you that you have a “Simple feature” collection of a certain number of rows and columns. More on this later, but sf stands for “simple features”. We only have 2 features here because I limited it to two in the code above to save room. The geometry type tell us that the data is represented as a polygon; an area represented by at least three sides. The could also be POINT or POLYLINE if there data were represented as points or lines. The dimension tells us how the data are measured. It will typically always be “XY” in our class. Next the bbox tells us what the bounding box is for these data. The bounding box is the smallest rectangle that can be drawn to fit all of the data points. The numbers tell us the coordinates of the sides of the bbox. From that we can figure out what the corners are as demonstrated below. The final bit of information is the geographic CRS which stands for coordinate reference system.
You noticed in the bbox line that the box is measured in some way, in this case it is familiar latitude and longitude. However, we can measure geographic space using thousands of different coordinate systems. This is a major theme in this course; There are many different ways to measure things in space. Each method has trade offs. You will have to know the basic systems and how to move data between them. You could spend an entire career learning about coordinate systems, but for this course, you will only need the basics. More on that below.
{sf} store geometry?We just learned that an sf object contains spatial information in the form of coordinates. In the case of acsTractsPHL.2016.sf, the coordinates form polygons. To see how this is stored, we can look at the geometry column.
acsTractsPHL.2016.sf$geometry
## Geometry set for 384 features
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -75.28027 ymin: 39.86705 xmax: -74.95578 ymax: 40.13799
## Geodetic CRS: NAD83
## First 5 geometries:
## MULTIPOLYGON (((-75.23536 39.96851, -75.2357 39...
## MULTIPOLYGON (((-75.23367 39.99188, -75.22566 3...
## MULTIPOLYGON (((-75.17785 39.97425, -75.17378 3...
## MULTIPOLYGON (((-75.13877 39.97932, -75.13814 3...
## MULTIPOLYGON (((-75.13902 39.98876, -75.13835 3...
The geometry column is special column in an sf object that contains the spatial information. It is stored as plain human readable text: MULTIPOLYGON (((-75.23536 39.96851, -75.2357 39… The specific format of this text is the Simple Features standard. This a widely used standard which means there are lots of resources about it and lots of tools/packages/software that implement it. It is also where the {sf} packages gets its name! In the example above, the geometry column says that the object is a Multipart Polygon and then lists out all of the coordinates in longitude and latitude. For very complex shapes like a coast line you can imagine that this column could contain millions of coordinates and become quite large! That will be a consideration during this course so keep it in mind.
If you look at what class the geometry column is, you can see that it is a class "sfc_MULTIPOLYGON" "sfc" . This is different from typical data.frame column classes such as character or numeric. The sfc in this special class means simple features collection.
class(acsTractsPHL.2016.sf$geometry)
## [1] "sfc_MULTIPOLYGON" "sfc"
The geometry column as an sfc can stand on its own, be plotted and analyzed, but it isn’t very interesting without the associated data. Adding the interesting data is exactly what the geometry = TRUE argument of get_acs() does for us. Putting it together, data.frame + sfc = sf!
The rabbit hole of understanding coordinate reference systems is very deep. We will only go down to the first level, but here are the things you need to start out with:
all spatial data need a CRS, otherwise they cannot be measured in space and cannot be spatially analyzed.
you need to know, or figure out, the CRS for all of the spatial data you work with. It is not enough just to guess. You will need to figure it out through looking at data sources and mapping data to confirm
There are two main types of CRS, Projected and Unprojected (aka Geographic) coordinate systems. Projected Coordinate Systems (PCS) are how we map/project the round earth onto a flat map. Common examples of PCS are State Plane, UTM, and Mercator coordinate systems. Geographic Coordinate Systems (GCS) are how we located things on the round’ish earth. Common examples are WGS84, GRS80, and NAD83. Every PCS is based on a CGS, but a GCS can be mapped on its own.
The most common GCS is WGS84, also known as EPSG:4326 . This is a very common CRS and the default for many mapping applications and mapping libraries like leaflet . You will use it very often.
To measure the distance between things, use a PCS such as State Plane or UTM with units in meters or feet. Trying to measure distances in Lat/Lon and GCS will often be difficult because the length of a meter (or a foot) changes in a GCS as you move north or south from the equator. A typical pattern is to project the data into UTM or similar for a step in the analysis and then move back to 4326 for mapping and other uses.
Here are some good places to start on these topics Geocomputation in R, ESRI Geographic vs. Projected Coordinate Systems, and Introduction to Spatial Analysis in R
Below we will go over some basic ways into interact with the CRS in {sf}.
sf objectWe saw that the acsTractsPHL.2016.sf object has a CRS of NAD83 when we inspected it. This CRS was automatically set by {tidycensus} because it knows where the data comes from. This will not always be the case! A format with a shapefile that has CRS information in the metadata will likely assign properly, but if you load a CSV with point coordinates, {sf} will not guess as to which CRS it is. We will see how to assign a CRS soon.
To get a more in depth view of the CRS information in acsTractsPHL.2016.sf we use the st_csr function in {sf}.
Note: the majority of functions you will use in the {sf} package start with the prefix st_. This is a convention from spatial databases and used by {sf} for consistency.
st_crs(acsTractsPHL.2016.sf)
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
This gives us two bits of info, the CRS as NAD83 and then the CRS in the wkt format; this stands for well-known-text. The wkt is a human readable version of the CRS specification with important information. There are a lot of numbers, but the important to us parts are the GEOGCRS at the start so we see it is a GCS and ID["EPSG",4269] at the end so we know the EPSG code.
tm_shape(acsTractsPHL.2016.sf,
projection = st_crs(acsTractsPHL.2016.sf)) +
tm_fill() +
tm_borders() +
tm_layout(title= 'NAD83',
title.position = c('left', 'top'))
Moving our data into a different projection is usually easy in {sf} with the st_transform() function. Here we will re-project our data from the GCS of NAD83 to the PCS of UTM Zone 18N which stands for the Universal Transverse Mercator PCS. Specifically the 18 North zone of UTM. The UTM PCS is a very useful projection!
Here is the code to reproject our data.
acsTractsPHL.2016.sf_UTM <- acsTractsPHL.2016.sf %>%
st_transform(crs = "EPSG:26918")
st_crs(acsTractsPHL.2016.sf_UTM)
## Coordinate Reference System:
## User input: EPSG:26918
## wkt:
## PROJCRS["NAD83 / UTM zone 18N",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]],
## CONVERSION["UTM zone 18N",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-75,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9996,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["North America - between 78°W and 72°W - onshore and offshore. Canada - Nunavut; Ontario; Quebec. United States (USA) - Connecticut; Delaware; Maryland; Massachusetts; New Hampshire; New Jersey; New York; North Carolina; Pennsylvania; Virginia; Vermont."],
## BBOX[28.28,-78,84,-72]],
## ID["EPSG",26918]]
We can see the numerous additions to the CRS output including the PROJCRS title instead of GEOGCRS and all of the projection parameters that map space to a flat space. Comparing the map below to the NAD83 map shows a little difference in orientation, but it is not very noticeable. However, in the new UTM83 18N version, you will see that we are in CRS with a set length unit LENGTHUNIT["metre",1] so that we can measure distance in an accurate way.
tm_shape(acsTractsPHL.2016.sf_UTM,
projection = st_crs(acsTractsPHL.2016.sf_UTM)) +
tm_fill() +
tm_borders() +
tm_layout(title= 'UTM 18N',
title.position = c('left', 'top'))
UTM is great for local and regional scale data that fit within a single zone. Sometimes we need data across a broader scale. Also, sometimes was need a CRS that is not in the EPSG system. Here we use a common ESRI based projection that has the domain of the continental U.S. Using a non-EPSG domain is as easy as using crs = "ESRI:102003"
acsTractsPHL.2016.sf_Albers <- acsTractsPHL.2016.sf %>%
st_transform(crs = "ESRI:102003")
st_crs(acsTractsPHL.2016.sf_Albers)
## Coordinate Reference System:
## User input: ESRI:102003
## wkt:
## PROJCRS["USA_Contiguous_Albers_Equal_Area_Conic",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["Degree",0.0174532925199433]]],
## CONVERSION["USA_Contiguous_Albers_Equal_Area_Conic",
## METHOD["Albers Equal Area",
## ID["EPSG",9822]],
## PARAMETER["Latitude of false origin",37.5,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-96,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",29.5,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",45.5,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Not known."],
## AREA["United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming."],
## BBOX[24.41,-124.79,49.38,-66.91]],
## ID["ESRI",102003]]
Compared to the UTM and NAD83 version, you can see Broad St. is much closer to a North/South alignment. This is a pretty good projection for Philadelphia Data.
tm_shape(acsTractsPHL.2016.sf_Albers,
projection = st_crs(acsTractsPHL.2016.sf_Albers)) +
tm_fill() +
tm_borders() +
tm_layout(title= 'USA Contiguous albers',
title.position = c('left', 'top'))
Finally, we can easily project our data back to a GCS, this time using the standard EPSG:4326 which will allow us to map in many common frameworks like leaflet.
acsTractsPHL.2016.sf_WGS84 <- acsTractsPHL.2016.sf %>%
st_transform(crs = "EPSG:4326")
st_crs(acsTractsPHL.2016.sf_WGS84)
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
tm_shape(acsTractsPHL.2016.sf_WGS84,
projection = st_crs(acsTractsPHL.2016.sf_WGS84)) +
tm_fill() +
tm_borders() +
tm_layout(title= 'USA Contiguous albers',
title.position = c('left', 'top'))
Often times we are making data or receive data in a format that does not carry any CRS information. Hopefully, you know what the CRS is or can ask the person that sent it. If you don’t have that you can google your data source. The last resort is having to try a number of different projections until your data lines up the way you want it to. This can be quite a challenge.
Here we will create some data for Philadelphia. This is an example as if you just read in a CSV file of points.
PHL_data <- data.frame(point_ID = seq(1,300,1),
variable1 = rnorm(300,0,1)) %>%
mutate(latitude = sample(seq(39.852,40.052,0.001),n(), replace = TRUE),
longitude = sample(seq(-75.265,-75.065,0.001),n(), replace = TRUE))
head(PHL_data)
## point_ID variable1 latitude longitude
## 1 1 1.3982972 39.896 -75.098
## 2 2 0.6425140 39.950 -75.147
## 3 3 -0.1128888 39.970 -75.176
## 4 4 -0.2124540 40.020 -75.218
## 5 5 0.2063796 40.012 -75.129
## 6 6 0.2751510 40.028 -75.135
In this case we will know that the CRS is 4326. We can assign the CRS with the st_as_af() function and the coords argument to specify the columns with longitude and latitude (In that order!) and the crs argument with the "EPSG:4326" designation.
PHL_data.sf <- PHL_data %>%
st_as_sf(coords = c("longitude", "latitude"),
crs = "EPSG:4326")
Combining the points with our ACS tract outlines works since both objects have the same CRS of 4326.
tm_shape(acsTractsPHL.2016.sf_WGS84) +
tm_borders() +
tm_shape(PHL_data.sf) +
tm_symbols(col = "red") +
tm_legend(show = FALSE) +
tm_layout(title= 'WGS84 - 4326',
title.position = c('left', 'top'))
Finally, another common operation is to remove the geometry to either do non-spatial analytics or transform in some way and join back to the data.
To remove the geometry all together is as easy as using st_drop_geometry(). This leaves you with a plain data.frame with all aspects of the spatial data removed, including the geometry column.
acsTractsPHL.2016_nonsf <- st_drop_geometry(acsTractsPHL.2016.sf)
Another very common thing is to get the centroids of the polygons.
acsTractsPHL.2016_centroid <- acsTractsPHL.2016.sf %>%
st_centroid()
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
## geometries of x
head(acsTractsPHL.2016_centroid)
## Simple feature collection with 6 features and 11 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.23167 ymin: 39.96808 xmax: -75.13394 ymax: 39.9918
## Geodetic CRS: NAD83
## GEOID NAME
## 1 42101010200 Census Tract 102, Philadelphia County, Pennsylvania
## 2 42101011900 Census Tract 119, Philadelphia County, Pennsylvania
## 3 42101013900 Census Tract 139, Philadelphia County, Pennsylvania
## 4 42101015700 Census Tract 157, Philadelphia County, Pennsylvania
## 5 42101016300 Census Tract 163, Philadelphia County, Pennsylvania
## 6 42101016800 Census Tract 168, Philadelphia County, Pennsylvania
## total_pop.2016 total_HU.2016 total_vacant.2016 med_HH_Income.2016
## 1 3008 1483 225 16071
## 2 4738 2509 407 30854
## 3 2960 1342 218 14314
## 4 2688 1128 238 38991
## 5 3710 1252 173 14017
## 6 3605 1940 405 17250
## total_White.2016 total_GradDeg.2016 geometry
## 1 0 145 POINT (-75.23167 39.96808)
## 2 47 175 POINT (-75.22881 39.98613)
## 3 466 74 POINT (-75.17112 39.97499)
## 4 1485 147 POINT (-75.13552 39.97804)
## 5 1605 10 POINT (-75.13394 39.9885)
## 6 154 99 POINT (-75.16762 39.9918)
## vacancyPct.2016 pctWhite.2016 mtAiry
## 1 0.1517195 0.000000000 REST OF PHILADELPHIA
## 2 0.1622160 0.009919797 REST OF PHILADELPHIA
## 3 0.1624441 0.157432432 REST OF PHILADELPHIA
## 4 0.2109929 0.552455357 REST OF PHILADELPHIA
## 5 0.1381789 0.432614555 REST OF PHILADELPHIA
## 6 0.2087629 0.042718447 REST OF PHILADELPHIA
tm_shape(acsTractsPHL.2016.sf_WGS84) +
tm_borders() +
tm_shape(acsTractsPHL.2016_centroid) +
tm_symbols(col = "red") +
tm_legend(show = FALSE) +
tm_layout(title= 'Polygon Centroids',
title.position = c('left', 'top'))
Quite often we want a list of the centroid coordinates for analysis of the coordinates or to join to other data. The st_coordinates() function makes that quite easy! Running that on the centroid data leaves us with a simple data frame of X and Y coordinates.
acsTractsPHL.2016_coords <- acsTractsPHL.2016_centroid %>%
st_coordinates()
head(acsTractsPHL.2016_coords)
## X Y
## 1 -75.23167 39.96808
## 2 -75.22881 39.98613
## 3 -75.17112 39.97499
## 4 -75.13552 39.97804
## 5 -75.13394 39.98850
## 6 -75.16762 39.99180
Sometimes we need to do operations on tables or non-spatial data where the geometry column gets in the way. In this case you can sometime drop the geometry, do what is needed, and then bind it back. A note of caution here, when you cbind() there are no checks that the rows of the two data sets are in the same order. They have to be the same number of rows, but if you analysis resorted or arranged one data set, the coords of the other will bind, but not be the correct coordinates. This is why checking your work with plots and maps is always a good idea!
acsTractsPHL_Income_coords <- st_drop_geometry(acsTractsPHL.2016.sf) %>%
select(GEOID, med_HH_Income.2016) %>%
cbind(acsTractsPHL.2016_coords)
head(acsTractsPHL_Income_coords)
## GEOID med_HH_Income.2016 X Y
## 1 42101010200 16071 -75.23167 39.96808
## 2 42101011900 30854 -75.22881 39.98613
## 3 42101013900 14314 -75.17112 39.97499
## 4 42101015700 38991 -75.13552 39.97804
## 5 42101016300 14017 -75.13394 39.98850
## 6 42101016800 17250 -75.16762 39.99180
A safer way to add the coordinate column and retain it with the index column GEOID without the fear of accidentally reordering the rows is to use dplyr::mutate . This assigned the coordinate columns inline and guarantees that they match.
GEOID_Coords <- acsTractsPHL.2016_centroid %>%
dplyr::mutate(longitude = sf::st_coordinates(.)[,1],
latitude = sf::st_coordinates(.)[,2]) %>%
st_drop_geometry() %>%
select(GEOID, latitude, longitude)
head(GEOID_Coords)
## GEOID latitude longitude
## 1 42101010200 39.96808 -75.23167
## 2 42101011900 39.98613 -75.22881
## 3 42101013900 39.97499 -75.17112
## 4 42101015700 39.97804 -75.13552
## 5 42101016300 39.98850 -75.13394
## 6 42101016800 39.99180 -75.16762
Finally, with the GEOID_Coords data.frame above, we have simple table the has our ID column as GEOID and X/Y point coordinates. This is itself, not an sf object, but it contains spatial information in coordinate form. WIth this we can make any other non-spatial table into an sf object as long as it has the GEOID column.
The example below pretends that new_nonspatial_data is some data we just loaded that has the GEOID, but no additional spatial information. We go on to further modify that table with a filer and a sort so that it no longer has the same number of rows or same order as before. Our approach of using cbind to add the geometry would fail here. Fortunately, we can use the power of table joins to help. Here we use left_join() to only join the lat/lon coordinates that match the GEOID in the filtered/sorted table. We specify to join on the GEOID field with by = "GEOID". Then we use st_as_sf() as we had before to make normal table into an sf object with the correct CRS. Finally, we use st_transform() to reproject it to a coordinate system for mapping with out other data.
new_nonspatial_data <- st_drop_geometry(acsTractsPHL.2016.sf)
new_nonspatial_data_filter <- new_nonspatial_data %>%
filter(vacancyPct.2016 > 0.3) %>%
arrange(desc(med_HH_Income.2016))
filter_data_joined_sf <- new_nonspatial_data_filter %>%
left_join(GEOID_Coords, by = "GEOID") %>%
st_as_sf(coords = c("longitude", "latitude"),
crs = "EPSG:4269") %>%
st_transform(crs = "EPSG:4326")
head(filter_data_joined_sf)
## Simple feature collection with 6 features and 11 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.20931 ymin: 39.9333 xmax: -75.1509 ymax: 40.01119
## Geodetic CRS: WGS 84
## GEOID NAME
## 1 42101012203 Census Tract 122.03, Philadelphia County, Pennsylvania
## 2 42101020102 Census Tract 201.02, Philadelphia County, Pennsylvania
## 3 42101015101 Census Tract 151.01, Philadelphia County, Pennsylvania
## 4 42101003200 Census Tract 32, Philadelphia County, Pennsylvania
## 5 42101020000 Census Tract 200, Philadelphia County, Pennsylvania
## 6 42101016902 Census Tract 169.02, Philadelphia County, Pennsylvania
## total_pop.2016 total_HU.2016 total_vacant.2016 med_HH_Income.2016
## 1 643 742 384 38438
## 2 3083 1678 531 32066
## 3 2317 1165 388 30051
## 4 4631 2268 709 27256
## 5 1327 766 241 25066
## 6 5797 2912 954 24802
## total_White.2016 total_GradDeg.2016 vacancyPct.2016 pctWhite.2016
## 1 312 135 0.5175202 0.485225505
## 2 8 68 0.3164482 0.002594875
## 3 74 16 0.3330472 0.031937851
## 4 392 109 0.3126102 0.084646945
## 5 50 85 0.3146214 0.037678975
## 6 88 68 0.3276099 0.015180266
## mtAiry geometry
## 1 REST OF PHILADELPHIA POINT (-75.20931 40.00757)
## 2 REST OF PHILADELPHIA POINT (-75.15453 40.01119)
## 3 REST OF PHILADELPHIA POINT (-75.18575 39.98696)
## 4 REST OF PHILADELPHIA POINT (-75.18542 39.9333)
## 5 REST OF PHILADELPHIA POINT (-75.1509 40.00253)
## 6 REST OF PHILADELPHIA POINT (-75.18319 39.99398)
tm_shape(acsTractsPHL.2016.sf_WGS84) +
tm_borders() +
tm_shape(filter_data_joined_sf) +
tm_symbols(col = "red") +
tm_legend(show = FALSE) +
tm_layout(title= 'Polygon Centroids',
title.position = c('left', 'top'))
Spatial joins are at the foundation of doing spatial analysis. Typical table joins allow us to logically combine columns based on an identifier. We used that approach above and used st_as_af() to make it spatial. This works well for point coordinates and when the filtering is not spatial. In cases where the filter, sorting, and summarizing operations are spatial we need spatial joins. This exercise ends on a quick spatial join example, but the following exercises and assignments require spatial operations like this to replicate typical GIS functionality.
THe objective below is to find the average of variable1 from our made up point data PHL_data.sf for each ACS tract polygon that contains a point. We achieve this by using st_join to intersect the polygon features with the points. The resulting table will have more rows than the original acsTractsPHL.2016.sf because several tracks will contain numerous points and are therefore duplicated for each of the variable1 values of PHL_data.sf. Imagine that PHL_data.sf contains crime incidents or traffic counts and we want to extrapolate them to the tract polygons.
ACS_polys_with_pnts <- acsTractsPHL.2016.sf %>%
st_join(st_transform(PHL_data.sf, crs = "EPSG:4269"))
paste0("original rows:",nrow(acsTractsPHL.2016.sf),", joined rows:", nrow(ACS_polys_with_pnts))
## [1] "original rows:384, joined rows:465"
Next we do our summarization of other geoprocessing that aggregates values to the group level, in this case the GEOID. Not that ACS_polys_with_pnts, the data used in this operation, is an sf object. We are no longer dropping geometry and then joining it back. the {sf} package is smart enough to be able to do all sorts of mutations and summaries while keeping the spatial information intact and correctly computed. It is quite amazing! Below the ACS_polys_with_pnts table is filtered so only the polygons with variable1 values are left. These are the polygons that had at least one point within them. We group by the GEOID so that whatever happens next does it for each group of ID. Finally, we use summarise() to run an operation on each group. In this case we are taking the mean of the values in the variable1 column.
ACS_poly_pnts_summ <- ACS_polys_with_pnts %>%
filter(!is.na(variable1)) %>%
group_by(GEOID) %>%
summarise(mean_variable1 = mean(variable1))
head(ACS_poly_pnts_summ)
## Simple feature collection with 6 features and 2 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -75.19022 ymin: 39.92468 xmax: -75.14147 ymax: 39.96003
## Geodetic CRS: NAD83
## # A tibble: 6 × 3
## GEOID mean_variable1 geometry
## <chr> <dbl> <POLYGON [°]>
## 1 42101000100 0.643 ((-75.15234 39.94942, -75.14919 39.94903, -75.1440…
## 2 42101000300 -0.588 ((-75.17913 39.958, -75.17972 39.95616, -75.17753 …
## 3 42101001002 -0.106 ((-75.15075 39.94182, -75.1476 39.94142, -75.14406…
## 4 42101001700 -0.442 ((-75.15329 39.93774, -75.15176 39.93733, -75.1502…
## 5 42101002701 -0.521 ((-75.15609 39.9251, -75.15453 39.92489, -75.15294…
## 6 42101003200 0.517 ((-75.19022 39.93078, -75.18864 39.93058, -75.1852…
Here is a map of the mean variable1 values aggregated up to the tract polygons.
tm_shape(acsTractsPHL.2016.sf) +
tm_borders(col = "black") +
tm_shape(ACS_poly_pnts_summ) +
tm_polygons("mean_variable1", midpoint = 0) +
tm_legend(show = TRUE) +
tm_layout(title= 'Mean Variable1',
title.position = c('left', 'top'))